Let's draw some maps. 🗺🧐
Let's start with altair. When your dataset is large, it is nice to enable a json data transformer. What it does is, instead of generating and holding the whole dataset in the memory, transform the dataset and save into a temporary file. This makes the whole plotting process much more efficient. For more information, check out: https://altair-viz.github.io/user_guide/data_transformers.html
import altair as alt
# saving data into a file rather than embedding into the chart
alt.data_transformers.enable('json')
#alt.renderers.enable('notebook')
# alt.renderers.enable('jupyterlab')
alt.renderers.enable('default')
RendererRegistry.enable('default')
Maybe we need a dataset with geographical coordinates. This zipcodes dataset contains the location and zipcode of each zip code area.
from vega_datasets import data
zipcodes_url = data.zipcodes.url
zipcodes = data.zipcodes()
zipcodes.head()
| zip_code | latitude | longitude | city | state | county | |
|---|---|---|---|---|---|---|
| 0 | 00501 | 40.922326 | -72.637078 | Holtsville | NY | Suffolk |
| 1 | 00544 | 40.922326 | -72.637078 | Holtsville | NY | Suffolk |
| 2 | 00601 | 18.165273 | -66.722583 | Adjuntas | PR | Adjuntas |
| 3 | 00602 | 18.393103 | -67.180953 | Aguada | PR | Aguada |
| 4 | 00603 | 18.455913 | -67.145780 | Aguadilla | PR | Aguadilla |
zipcodes = data.zipcodes(dtype={'zip_code': 'category'})
zipcodes.head()
| zip_code | latitude | longitude | city | state | county | |
|---|---|---|---|---|---|---|
| 0 | 00501 | 40.922326 | -72.637078 | Holtsville | NY | Suffolk |
| 1 | 00544 | 40.922326 | -72.637078 | Holtsville | NY | Suffolk |
| 2 | 00601 | 18.165273 | -66.722583 | Adjuntas | PR | Adjuntas |
| 3 | 00602 | 18.393103 | -67.180953 | Aguada | PR | Aguada |
| 4 | 00603 | 18.455913 | -67.145780 | Aguadilla | PR | Aguadilla |
zipcodes.zip_code.dtype
CategoricalDtype(categories=['00501', '00544', '00601', '00602', '00603', '00604',
'00605', '00606', '00610', '00611',
...
'99919', '99921', '99922', '99923', '99925', '99926',
'99927', '99928', '99929', '99950'],
ordered=False)
Btw, you'll have fewer issues if you pass URL instead of a dataframe to alt.Chart.
Now we have the dataset loaded and start drawing some plots. Let's say you don't know anything about map projections. What would you try with geographical data? Probably the simplest way is considering (longitude, latitude) as a Cartesian coordinate and directly plot them.
alt.Chart(zipcodes_url).mark_circle().encode(
x='longitude:Q',
y='latitude:Q',
)
Actually this itself is a map projection called Equirectangular projection. This projection (or almost a non-projection) is super straight-forward and doesn't require any processing of the data. So, often it is used to just quickly explore geographical data. As you dig deeper, you still want to think about which map projection fits your need best. Don't just use equirectangular projection without any thoughts!
Anyway, let's make it look slighly better by reducing the size of the circles and adjusting the aspect ratio.
Q: Can you adjust the circle size, width and height of the chart?
# Implement
alt.Chart(zipcodes_url).mark_circle(size = 15).encode(
x='longitude:Q',
y='latitude:Q').properties( width=1000, height = 300)
But, a much better way to do this is explicitly specifying that they are lat, lng coordinates by using longitude= and latitude=, rather than x= and y=. If you do that, altair automatically adjust the aspect ratio.
Q: Can you try it?
# Implement
alt.Chart(zipcodes_url).mark_circle(size = 8).encode(
longitude='longitude:Q',
latitude='latitude:Q').properties( width=1000, height = 300)
Because the American empire is far-reaching and complicated, the information density of this map is very low (although interesting). A common projection for visualizing US data is AlbersUSA, which uses Albers (equal-area) projection. This is a standard projection used in United States Geological Survey and the United States Census Bureau. Albers USA contains a composition of US main land, Alaska, and Hawaii.
To use it, we call project method and specify which variables are longitude and latitude.
Q: use the project method to draw the map in the AlbersUsa projection.
# Implement
alt.Chart(zipcodes_url).mark_circle(size=2).encode(
longitude='longitude:Q',
latitude='latitude:Q'
).project(
type='albersUsa'
).properties(
width=700,
height=400,
)
Now we're talking. 😎
Let's visualize the large-scale zipcode patterns. We can use the fact that the zipcodes are hierarchically organized. That is, the first digit captures the largest area divisions and the other digits are about smaller geographical divisions.
Altair provides some data transformation functionalities. One of them is extracting a substring from a variable.
from altair.expr import datum, substring
alt.Chart(zipcodes_url).mark_circle(size=2).transform_calculate(
'first_digit', substring(datum.zip_code, 0, 1)
).encode(
longitude='longitude:Q',
latitude='latitude:Q',
color='first_digit:N',
).project(
type='albersUsa'
).properties(
width=700,
height=400,
)
For each row (datum), you obtain the zip_code variable and get the substring (imagine Python list slicing), and then you call the result first_digit. Now, you can use this first_digit variable to color the circles. Also note that we specify first_digit as a nominal variable, not quantitative, to obtain a categorical colormap. But we can also play with it too.
Q: Why don't you extract the first two digits, name it as two_digits, and declare that as a quantitative variable? Any interesting patterns? What does it tell us about the history of US?
# Implement
from altair.expr import datum, substring
alt.Chart(zipcodes_url).mark_circle(size=2).transform_calculate(
'two_digits', substring(datum.zip_code, 0, 2)
).encode(
longitude='longitude:Q',
latitude='latitude:Q',
color='two_digits:Q',
).project(
type='albersUsa'
).properties(
width=700,
height=400,
)
Any interesting patterns? What does it tell us about the history of US?
It seems that there are more zip codes/area divisions in the east side of the US. I am unable to interpret any thing about US history but it seems that northern states have more divisions due to longer settlement and earlier established local goverments, would be my educated guess.
Q: also try it with declaring the first two digits as a categorical variable
# Implement
from altair.expr import datum, substring
alt.Chart(zipcodes_url).mark_circle(size=2).transform_calculate(
'two_digits', substring(datum.zip_code, 0, 2)
).encode(
longitude='longitude:Q',
latitude='latitude:Q',
color='two_digits:N',
).project(
type='albersUsa'
).properties(
width=700,
height=400,
)
Btw, you can always click "view source" or "open in Vega Editor" to look at the json object that defines this visualization. You can embed this json object on your webpage and easily put up an interactive visualization.
Q: Can you put a tooltip that displays the zipcode when you mouse-over? Example https://altair-viz.github.io/gallery/scatter_tooltips.html
# Implement
from altair.expr import datum, substring
alt.Chart(zipcodes_url).mark_circle(size=2).transform_calculate(
'zips', substring(datum.zip_code, 0, 5)).transform_calculate('two', substring(datum.zip_code, 0, 2)
).encode(
longitude='longitude:Q',
latitude='latitude:Q',
color='two:N',
tooltip=['zips:Q']
).project(
type='albersUsa'
).properties(
width=700,
height=400,
)
Let's try some choropleth now. Vega datasets have US county / state boundary data (us_10m) and world country boundary data (world-110m). You can take a look at the boundaries on GitHub (they renders topoJSON files):
If you click "Raw" then you can take a look at the actual file, which is hard to read.
Essentially, each file is a large dictionary with the following keys.
usmap = data.us_10m()
usmap.keys()
dict_keys(['type', 'transform', 'objects', 'arcs'])
usmap['type']
'Topology'
usmap['transform']
{'scale': [0.003589294092944858, 0.0005371535195261037],
'translate': [-179.1473400003406, 17.67439566600018]}
This transformation is used to quantize the data and store the coordinates in integer (easier to store than float type numbers).
https://github.com/topojson/topojson-specification#212-transforms
usmap['objects'].keys()
dict_keys(['counties', 'states', 'land'])
This data contains not only county-level boundaries (objects) but also states and land boundaries.
usmap['objects']['land']['type'], usmap['objects']['states']['type'], usmap['objects']['counties']['type']
('MultiPolygon', 'GeometryCollection', 'GeometryCollection')
land is a multipolygon (one object) and states and counties contains many geometrics (multipolygons) because there are many states (counties). We can look at a state as a set of arcs that define it. It's id captures the identity of the state and is the key to link to other datasets.
state1 = usmap['objects']['states']['geometries'][1]
state1
{'arcs': [[[10337]],
[[10342]],
[[10341]],
[[10343]],
[[10834, 10340]],
[[10344]],
[[10345]],
[[10338]]],
'id': 15,
'type': 'MultiPolygon'}
The arcs referred here is defined in usmap['arcs'].
usmap['arcs'][:10]
[[[15739, 57220], [0, 0]], [[15739, 57220], [29, 62], [47, -273]], [[15815, 57009], [-6, -86]], [[15809, 56923], [0, 0]], [[15809, 56923], [-36, -8], [6, -210], [32, 178]], [[15811, 56883], [9, -194], [44, -176], [-29, -151], [-24, -319]], [[15811, 56043], [-12, -216], [26, -171]], [[15825, 55656], [-2, 1]], [[15823, 55657], [-19, 10], [26, -424], [-26, -52]], [[15804, 55191], [-30, -72], [-47, -344]]]
It seems pretty daunting to work with this dataset, right? But fortunately people have already built tools to handle such data.
# states
states = alt.topo_feature(data.us_10m.url, 'states')
# us counties
us_counties = alt.topo_feature(data.us_10m.url, 'counties')
states
UrlData({
format: TopoDataFormat({
feature: 'states',
type: 'topojson'
}),
url: 'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/us-10m.json'
})
Q. Can you find a mark for geographical shapes from here https://altair-viz.github.io/user_guide/marks.html and draw the states?
# Implement
alt.Chart(states).mark_geoshape()
And then project it using the albersUsa?
# Implement
alt.Chart(states).mark_geoshape().project(type='albersUsa')
Can you do the same thing with counties and draw county boundaries? (hint: you have to use alt.topo_feature())
# Implement
alt.Chart(alt.topo_feature(data.us_10m.url, 'counties')).mark_geoshape().project(type='albersUsa')
Let's load some county-level unemployment data.
unemp_data = data.unemployment(sep='\t')
unemp_data.head()
| id | rate | |
|---|---|---|
| 0 | 1001 | 0.097 |
| 1 | 1003 | 0.091 |
| 2 | 1005 | 0.134 |
| 3 | 1007 | 0.121 |
| 4 | 1009 | 0.099 |
This dataset has unemployment rate. When? I don't know. We don't care about data provenance here because the goal is quickly trying out choropleth. But if you're working with a real dataset, you should be very sensitive about the provenance of your dataset. Make sure you understand where the data came from and how it was processed.
Anyway, for each county specified with id. To combine two datasets, we use "Lookup transform" - https://vega.github.io/vega/docs/transforms/lookup/. Essentially, we use the id in the map data to look up (again) id field in the unemp_data and then bring in the rate variable. Then, we can use that rate variable to encode the color of the geoshape mark.
alt.Chart(us_counties).mark_geoshape().project(
type='albersUsa'
).transform_lookup(
lookup='id',
from_=alt.LookupData(data.unemployment.url, 'id', ['rate'])
).encode(
color='rate:Q'
).properties(
width=700,
height=400
)
There you have it, a nice choropleth map. 😎
Although many geovisualizations use vector graphics, raster visualization is still useful especially when you deal with images and lots of datapoints. Datashader is a package that aggregates and visualizes a large amount of data very quickly. Given a scene (visualization boundary, resolution, etc.), it quickly aggregate the data and produce pixels and send them to you.
To appreciate its power, we need a fairly large dataset. Let's use NYC taxi trip dataset on Kaggle: https://www.kaggle.com/kentonnlp/2014-new-york-city-taxi-trips You can download even bigger trip data from NYC open data website: https://opendata.cityofnewyork.us/data/
Ah, and you want to install the datashader, bokeh, and holoviews first if you don't have them yet. If you have them make sure they are the latest version
pip install -U datashader bokeh holoviews
or
conda install datashader bokeh holoviews
%matplotlib inline
!pip install -U datashader bokeh holoviews
import pandas as pd
import datashader as ds
from datashader import transfer_functions as tf
from colorcet import fire
Collecting datashader
Downloading https://files.pythonhosted.org/packages/4e/f7/ca60537cd166c0b770bdcc85f9bcc0ef68adb68152eaa561eefc3ebe5ebd/datashader-0.13.0-py2.py3-none-any.whl (15.8MB)
|████████████████████████████████| 15.8MB 174kB/s
Collecting bokeh
Downloading https://files.pythonhosted.org/packages/40/85/9c8c47dc99671590e21d0cecf5cf1208db0ddb525093b2fecdbb233e3645/bokeh-2.3.3.tar.gz (10.7MB)
|████████████████████████████████| 10.7MB 36.4MB/s
Requirement already up-to-date: holoviews in /usr/local/lib/python3.7/dist-packages (1.14.4)
Requirement already satisfied, skipping upgrade: pillow>=3.1.1 in /usr/local/lib/python3.7/dist-packages (from datashader) (7.1.2)
Requirement already satisfied, skipping upgrade: numpy>=1.7 in /usr/local/lib/python3.7/dist-packages (from datashader) (1.19.5)
Requirement already satisfied, skipping upgrade: pandas>=0.24.1 in /usr/local/lib/python3.7/dist-packages (from datashader) (1.1.5)
Requirement already satisfied, skipping upgrade: dask[complete]>=0.18.0 in /usr/local/lib/python3.7/dist-packages (from datashader) (2.12.0)
Collecting datashape>=0.5.1
Downloading https://files.pythonhosted.org/packages/a6/5b/95b2ed56b61e649b69c9a5b1ecb32ff0a5cd68b9f69f5aa7774540e6b444/datashape-0.5.2.tar.gz (76kB)
|████████████████████████████████| 81kB 9.0MB/s
Requirement already satisfied, skipping upgrade: param>=1.6.1 in /usr/local/lib/python3.7/dist-packages (from datashader) (1.10.1)
Requirement already satisfied, skipping upgrade: pyct>=0.4.5 in /usr/local/lib/python3.7/dist-packages (from datashader) (0.4.8)
Requirement already satisfied, skipping upgrade: numba>=0.51 in /usr/local/lib/python3.7/dist-packages (from datashader) (0.51.2)
Requirement already satisfied, skipping upgrade: scipy in /usr/local/lib/python3.7/dist-packages (from datashader) (1.4.1)
Requirement already satisfied, skipping upgrade: xarray>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from datashader) (0.18.2)
Requirement already satisfied, skipping upgrade: colorcet>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from datashader) (2.0.6)
Requirement already satisfied, skipping upgrade: PyYAML>=3.10 in /usr/local/lib/python3.7/dist-packages (from bokeh) (3.13)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from bokeh) (2.8.1)
Requirement already satisfied, skipping upgrade: Jinja2>=2.9 in /usr/local/lib/python3.7/dist-packages (from bokeh) (2.11.3)
Requirement already satisfied, skipping upgrade: packaging>=16.8 in /usr/local/lib/python3.7/dist-packages (from bokeh) (20.9)
Requirement already satisfied, skipping upgrade: tornado>=5.1 in /usr/local/lib/python3.7/dist-packages (from bokeh) (5.1.1)
Requirement already satisfied, skipping upgrade: typing_extensions>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from bokeh) (3.7.4.3)
Requirement already satisfied, skipping upgrade: pyviz-comms>=0.7.4 in /usr/local/lib/python3.7/dist-packages (from holoviews) (2.1.0)
Requirement already satisfied, skipping upgrade: panel>=0.8.0 in /usr/local/lib/python3.7/dist-packages (from holoviews) (0.11.3)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.1->datashader) (2018.9)
Requirement already satisfied, skipping upgrade: toolz>=0.7.3; extra == "complete" in /usr/local/lib/python3.7/dist-packages (from dask[complete]>=0.18.0->datashader) (0.11.1)
Collecting distributed>=2.0; extra == "complete"
Downloading https://files.pythonhosted.org/packages/31/8b/0d704fdaa170a05797057c4676ceb9f53e139111b9b37f53e90a62c4c770/distributed-2021.7.0-py3-none-any.whl (1.0MB)
|████████████████████████████████| 1.0MB 32.8MB/s
Collecting fsspec>=0.6.0; extra == "complete"
Downloading https://files.pythonhosted.org/packages/0e/3a/666e63625a19883ae8e1674099e631f9737bd5478c4790e5ad49c5ac5261/fsspec-2021.6.1-py3-none-any.whl (115kB)
|████████████████████████████████| 122kB 42.6MB/s
Collecting partd>=0.3.10; extra == "complete"
Downloading https://files.pythonhosted.org/packages/41/94/360258a68b55f47859d72b2d0b2b3cfe0ca4fbbcb81b78812bd00ae86b7c/partd-1.2.0-py3-none-any.whl
Requirement already satisfied, skipping upgrade: cloudpickle>=0.2.1; extra == "complete" in /usr/local/lib/python3.7/dist-packages (from dask[complete]>=0.18.0->datashader) (1.3.0)
Collecting multipledispatch>=0.4.7
Downloading https://files.pythonhosted.org/packages/89/79/429ecef45fd5e4504f7474d4c3c3c4668c267be3370e4c2fd33e61506833/multipledispatch-0.6.0-py3-none-any.whl
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.51->datashader) (57.0.0)
Requirement already satisfied, skipping upgrade: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.51->datashader) (0.34.0)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->bokeh) (1.15.0)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from Jinja2>=2.9->bokeh) (2.0.1)
Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=16.8->bokeh) (2.4.7)
Requirement already satisfied, skipping upgrade: requests in /usr/local/lib/python3.7/dist-packages (from panel>=0.8.0->holoviews) (2.23.0)
Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.7/dist-packages (from panel>=0.8.0->holoviews) (4.41.1)
Requirement already satisfied, skipping upgrade: markdown in /usr/local/lib/python3.7/dist-packages (from panel>=0.8.0->holoviews) (3.3.4)
Requirement already satisfied, skipping upgrade: msgpack>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (1.0.2)
Requirement already satisfied, skipping upgrade: click>=6.6 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (7.1.2)
Requirement already satisfied, skipping upgrade: psutil>=5.0 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (5.4.8)
Requirement already satisfied, skipping upgrade: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (2.4.0)
Requirement already satisfied, skipping upgrade: tblib>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (1.7.0)
Requirement already satisfied, skipping upgrade: zict>=0.1.3 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (2.0.0)
Collecting locket
Downloading https://files.pythonhosted.org/packages/50/b8/e789e45b9b9c2db75e9d9e6ceb022c8d1d7e49b2c085ce8c05600f90a96b/locket-0.2.1-py2.py3-none-any.whl
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->panel>=0.8.0->holoviews) (2.10)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->panel>=0.8.0->holoviews) (2021.5.30)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->panel>=0.8.0->holoviews) (1.24.3)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->panel>=0.8.0->holoviews) (3.0.4)
Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from markdown->panel>=0.8.0->holoviews) (4.6.0)
Requirement already satisfied, skipping upgrade: heapdict in /usr/local/lib/python3.7/dist-packages (from zict>=0.1.3->distributed>=2.0; extra == "complete"->dask[complete]>=0.18.0->datashader) (1.0.1)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->markdown->panel>=0.8.0->holoviews) (3.4.1)
Building wheels for collected packages: bokeh, datashape
Building wheel for bokeh (setup.py) ... done
Created wheel for bokeh: filename=bokeh-2.3.3-cp37-none-any.whl size=11342787 sha256=e243286a62b39c5706a48167d873bd777481aaaa232badf2abf5e452be4a3a02
Stored in directory: /root/.cache/pip/wheels/29/25/bd/ece99a9f4f1fdc59da3a94d1cd3e220c1c32624f19e4c19969
Building wheel for datashape (setup.py) ... done
Created wheel for datashape: filename=datashape-0.5.2-cp37-none-any.whl size=59439 sha256=83fc5bf2669cf16482299a5c06436a4f5c5839aebbb943b7b7d60ddfb8afb0cd
Stored in directory: /root/.cache/pip/wheels/8d/06/05/c1cba3d57bdcfd3960e3f60a9fdc97e4baef2ef09af0ad1ef8
Successfully built bokeh datashape
ERROR: distributed 2021.7.0 has requirement cloudpickle>=1.5.0, but you'll have cloudpickle 1.3.0 which is incompatible.
ERROR: distributed 2021.7.0 has requirement dask==2021.07.0, but you'll have dask 2.12.0 which is incompatible.
Installing collected packages: multipledispatch, datashape, datashader, bokeh, distributed, fsspec, locket, partd
Found existing installation: bokeh 2.3.2
Uninstalling bokeh-2.3.2:
Successfully uninstalled bokeh-2.3.2
Found existing installation: distributed 1.25.3
Uninstalling distributed-1.25.3:
Successfully uninstalled distributed-1.25.3
Successfully installed bokeh-2.3.3 datashader-0.13.0 datashape-0.5.2 distributed-2021.7.0 fsspec-2021.6.1 locket-0.2.1 multipledispatch-0.6.0 partd-1.2.0
from google.colab import files
uploaded = files.upload()
Because the dataset is pretty big, let's use a small sample first. For this visualization, we only keep the dropoff location.
ls
nyc_taxi.csv sample_data/
nyctaxi_small = pd.read_csv('nyc_taxi.csv', nrows=10000,
usecols=['dropoff_x', 'dropoff_y'])
nyctaxi_small.head()
| dropoff_x | dropoff_y | |
|---|---|---|
| 0 | -8.235438e+06 | 4.973928e+06 |
| 1 | -8.236276e+06 | 4.974675e+06 |
| 2 | -8.235815e+06 | 4.972018e+06 |
| 3 | -8.231489e+06 | 4.979778e+06 |
| 4 | -8.235296e+06 | 4.972574e+06 |
Although the dataset is different, we can still follow the example here: https://datashader.org/getting_started/Introduction.html
agg = ds.Canvas().points(nyctaxi_small, 'dropoff_x', 'dropoff_y')
tf.set_background(tf.shade(agg, cmap=fire),"black")
Why can't we see anything? Wait, do you see the small dots on the left top? Can that be New York City? Maybe we don't see anything because some people travel very far? or because the dataset has some missing data?
Q: Can you first check whether there are NaNs? Then drop them and draw the map again?
# Implement: Check whether we have NaNs
#Used the nyc_taxi dataset from datashader where there were no null values
nan_check = agg.isnull().values.any()
print(nan_check)
False
# Implement: drop the rows with NaN and then draw the map again.
#if x is nan then it is dropped, and if x is not nan can assume that y has value
agg_dropped = agg.dropna('dropoff_x', how='all')
tf.set_background(tf.shade(agg_dropped, cmap=fire),"black")
So it's not about the missing data.
Q: Can you identify the issue and draw the map like the following?
hint: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.between.html and histograms may be helpful.
# Implement. You can use multiple cells to figure out what's going on.
# Once you figure it out, create a new df nyctaxi_small_filtered where the issue is resolved
s = pd.Series([nyctaxi_small_filtered])
nyctaxi_small_filtered = s.between(-8231636.066880000, -8226513.93668, inclusive=False)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-133-0358da18475b> in <module>() 1 # Implement. You can use multiple cells to figure out what's going on. 2 # Once you figure it out, create a new df nyctaxi_small_filtered where the issue is resolved ----> 3 s = pd.Series(nyctaxi_small_filtered) 4 5 nyctaxi_small_filtered = s.between(-8231636.066880000, -8226513.93668, inclusive=False) /usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath) 229 name = ibase.maybe_extract_name(name, data, type(self)) 230 --> 231 if is_empty_data(data) and dtype is None: 232 # gh-17261 233 warnings.warn( /usr/local/lib/python3.7/dist-packages/pandas/core/construction.py in is_empty_data(data) 594 is_none = data is None 595 is_list_like_without_dtype = is_list_like(data) and not hasattr(data, "dtype") --> 596 is_simple_empty = is_list_like_without_dtype and not data 597 return is_none or is_simple_empty 598 /usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self) 1328 def __nonzero__(self): 1329 raise ValueError( -> 1330 f"The truth value of a {type(self).__name__} is ambiguous. " 1331 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." 1332 ) ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
print(type(nyctaxi_small))
<class 'pandas.core.frame.DataFrame'>
agg = ds.Canvas().points(nyctaxi_small, 'dropoff_x', 'dropoff_y')
tf.set_background(tf.shade(agg, cmap=fire), "black")
Do you see the black empty space at the center? That looks like the Central Park. This is cool, but it'll be awesome if we can explore the data interactively.
Q. Ok, now let's get serious by loading the whole dataset. It may take some time. Apply the same data cleaning procedure.
# Implement
nyctaxi = pd.read_csv('nyc_taxi.csv',
usecols=['dropoff_x', 'dropoff_y'])
nyctaxi.head()
| dropoff_x | dropoff_y | |
|---|---|---|
| 0 | -8.235438e+06 | 4.973928e+06 |
| 1 | -8.236276e+06 | 4.974675e+06 |
| 2 | -8.235815e+06 | 4.972018e+06 |
| 3 | -8.231489e+06 | 4.979778e+06 |
| 4 | -8.235296e+06 | 4.972574e+06 |
Can you feed the data directly to datashader to reproduce the static plot, this time with the full data?
# Implement
agg = ds.Canvas().points(nyctaxi, 'dropoff_x', 'dropoff_y')
tf.set_background(tf.shade(agg, cmap=fire),"black")
Wow, that's fast. Also it looks cool!
Let's try the interactive version from here: https://datashader.org/getting_started/Introduction.html
import holoviews as hv
from holoviews.element.tiles import EsriImagery
from holoviews.operation.datashader import datashade
hv.extension('bokeh')
map_tiles = EsriImagery().opts(alpha=0.5, width=900, height=480, bgcolor='black')
points = hv.Points(nyctaxi, ['dropoff_x', 'dropoff_y'])
taxi_trips = datashade(points, x_sampling=1, y_sampling=1, cmap=fire, width=900, height=480)
map_tiles * taxi_trips
Why does it say "map data not yet available"? The reason is the difference between two coordinate systems. If you google this error message, you can find https://stackoverflow.com/questions/44487898/map-background-with-datashader-map-data-not-yet-available.
You can use datashader.utils.lnglat_to_meters to convert your latitudes and longitudes to a format that holoviews understands. More on this here: https://datashader.org/user_guide/Geography.html
Q: Can you draw an interactive map by converting the lnglat data to x, y coordinate explained above?
# Implement
from datashader.bokeh_ext import InteractiveImage
import bokeh.plotting as bp
from bokeh.models.tiles import WMTSTileSource
bp.output_notebook()
NYC = x_range, y_range = ((-8242000,-821000), (4965000, 4990000))
def base_plot(tools='pan,wheel_zoom,reset',webgl=False):
p = bp.figure(tools=tools,
plot_width=int(850), plot_height=int(500),
x_range=x_range, y_range=y_range, outline_line_color=None,
min_border=0, min_border_left=0, min_border_right=0,
min_border_top=0, min_border_bottom=0)
p.axis.visible = False
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
return p
points_meters = hv.Points(nyctaxi, ['dropoff_x', 'dropoff_y'])
def image_callback(x, y, w,h, color_fn =tf.shade):
cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x, y_range=y)
agg = cvs.points(points_meters., 'dropoff_x', 'dropoff_y', ds.count('dropoff_x'))
image = color_fn(agg)
return tf.dynspread(image, threshold = 0.75, max_px = 8)
p = base_plot()
map_tiles = EsriImagery().opts(alpha=0.5, width=900, height=480, bgcolor='black')
tile_render = p.add_tile(WMTSTileSource())
meters = ds.utils.lnglat_to_meters(nyctaxi['dropoff_x'], nyctaxi['dropoff_y'])
points = hv.Points(meters, ['dropoff_x', 'dropoff_y'])
InteractiveImage(p, image_callback)
/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in log result = getattr(ufunc, method)(*inputs, **kwargs) /usr/local/lib/python3.7/dist-packages/datashader/bokeh_ext.py:236: VisibleDeprecationWarning: InteractiveImage has been deprecated as of datashader 0.8.0. It is not supported in JupyterLab and Bokeh server environments. Please use the HoloViews datashader integration instead. 'integration instead.', VisibleDeprecationWarning)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-159-1c6b54e4f1c1> in <module>() 29 meters = ds.utils.lnglat_to_meters(nyctaxi['dropoff_x'], nyctaxi['dropoff_y']) 30 points = hv.Points(meters, ['dropoff_x', 'dropoff_y']) ---> 31 InteractiveImage(p, image_callback) /usr/local/lib/python3.7/dist-packages/datashader/bokeh_ext.py in __init__(self, bokeh_plot, callback, delay, timeout, throttle, **kwargs) 246 247 # Initialize the image and callback --> 248 self.ds, self.renderer = self._init_image() 249 callback = self._init_callback() 250 self.p.x_range.js_on_change('start', callback) /usr/local/lib/python3.7/dist-packages/datashader/bokeh_ext.py in _init_image(self) 283 y_range = (ymin, ymax) 284 dw, dh = xmax - xmin, ymax - ymin --> 285 image = self.callback(x_range, y_range, width, height, **self.kwargs) 286 287 ds = ColumnDataSource(data=dict(image=[image.data], x=[xmin], <ipython-input-159-1c6b54e4f1c1> in image_callback(x, y, w, h, color_fn) 21 def image_callback(x, y, w,h, color_fn =tf.shade): 22 cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x, y_range=y) ---> 23 agg = cvs.points(points_meters, 'dropoff_x', 'dropoff_y', ds.count('dropoff_x')) 24 image = color_fn(agg) 25 return tf.dynspread(image, threshold = 0.75, max_px = 8) /usr/local/lib/python3.7/dist-packages/datashader/core.py in points(self, source, x, y, agg, geometry) 212 glyph = MultiPointGeometry(geometry) 213 --> 214 return bypixel(source, self, glyph, agg) 215 216 def line(self, source, x=None, y=None, agg=None, axis=0, geometry=None, /usr/local/lib/python3.7/dist-packages/datashader/core.py in bypixel(source, canvas, glyph, agg) 1202 dshape = dshape_from_xarray_dataset(source) 1203 else: -> 1204 raise ValueError("source must be a pandas or dask DataFrame") 1205 schema = dshape.measure 1206 glyph.validate(schema) ValueError: source must be a pandas or dask DataFrame
It's interactive! Actually, if you are running a bokeh server and there is a live python process, the map quickly refreshes and show more details as you zoom.
Q: how many rows (data points) are we visualizing right now?
# figure it out
len(nyctaxi)
50000
That's a lot of data points. If we are using a vector format, it is probably hopeless to expect any interactivity because you need to move that many points! Yet, datashader + holoviews + bokeh renders everything almost in real time!
Another useful tool is Leaflet. It allows you to use various map tile data (Google maps, Open streetmap, ...) with many types of marks (points, heatmap, etc.). Leaflet.js is one of the easiest options to do that on the web, and there is a Python bridge of it: https://github.com/jupyter-widgets/ipyleaflet. Although we will not go into details, it's certainly something that's worth checking out if you're using geographical data.